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Abstract 

Given n noisy observations g{ of the same quantity /, it is common 
use to give an estimate of / by minimizing the function ~ /) 2 - 

From a statistical point of view this corresponds to computing the 
Maximum Likelihood estimate, under the assumption of Gaussian 
noise. However, it is well known that this choice leads to results 
that are very sensitive to the presence of outliers in the data. For 
this reason it has been proposed to minimize functions of the form 
i V(gi — /), where V is a function that increases less rapidly than 
the square. Several choices for V have been proposed and success¬ 
fully used to obtain “robust” estimates. In this paper we show that, 
for a class of functions V, using these robust estimators corresponds 
to assuming that data are corrupted by Gaussian noise whose vari¬ 
ance fluctuates according to some given probability distribution, that 
uniquely determines the shape of V. 
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1 Introduction 


A common problem in statistics is the following: given n noisy observa¬ 
tions gi of the same quantity /, give an estimate of /. A typical solution 
to this problem consists in choosing the value of / that maximes the like¬ 
lihood function P(g\f), that is the probability of having observed the data 
g = (</i,. . . ,g n ) if the true value was /. Estimates of this type are named 
Maximum Likelihood (ML) estimates, and rely on the assumption that we 
know the likelihood function P(g i,. . . ,g n \f), that is essentially a model of 
how noise affected the measure process. 

A common assumption is that of additive Gaussian noise, in which we 
assume that the measurement gi are related to the true value by the relation 

9* = f + Q , * = l,...,n , 

where e 4 - are independent random variables with given gaussian probability 
distributions P 8 (e 8 ) of variance cq and zero mean. In this case the likelihood 
function is 


n n IR 

p(9u----3r,\f) = n^( £ .) = n\/-e- a(9, - ,)J (i) 

i Vr V 7T 
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where /3; = Maximizing the likelihood function (1) corresponds therefore 

° i 

to solve the following minimization problem: 


N 

min J2 ~ f) 2 ■ ( 2 ) 

* i=l 

An elementary computation shows that the solution is the weighted av¬ 
erage of the data: 


r E" = l Pi9i 

1 Ef=i A ' 

The ML estimate has therefore a simple meaning and it is easy to com¬ 
pute. However, it is well known that estimates of this type are not “robust”, 
that is are they very sensitive to the presence of outliers in the data. In order 
to overcome this difficulty it has been proposed to use a modified version of 
the minimization problem (2): 
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m f in 12 V (ffi ~ /) > 
* 8 = 1 
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Figure 1: Different choices for the function V. (1) G(,r) is quadratic for 
x < 0.5 and then constant. (2) G(,r) = |,r|. (3) G(,r) is quadratic for x < 0.5 
and then linear, with continuous first derivative. 


where the quadratic function (g t — f) 2 has been substituted by some other 
less rapidly increasing even function V. Estimator of this type are known 
in statistics, for particular choices of V, as robust estimators (Huber, 1981). 
The idea underlying (3) is that if the error (g t — f) 2 is large, it is likely that gi 
is an outlier, so that we do not want to enforce / to be close to it. Therefore 
the function V should not increase much after a certain value. Different 
shapes for V have been proposed, and some of them have been depicted in 
figure (1). 

In this paper we want to give a more rigorous justification for the use of es¬ 
timates like the one of ecp (3), and also to give an interpretation of the model 
of noise to which they correspond. We will see that if the function e -1- G) 
is completely monotone, then using ecp (3) corresponds to assuming that 
our measures are affected by a Gaussian noise whose variance is a random 
variable with given probability distribution. Depending on the probability 
distribution of the variance of the noise, different shapes for V are obtained. 
For a particular choice of V a justification of such a technique was given in 
(Girosi, Poggio and Caprile, 1991), but no characterization was given. In the 
next section we formalize these statements, while in the following sections 
we present a large class of functions V that can be used, together with some 
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examples. 


2 Robust Maximum Likelihood Estimates 

In order to simplify the notation we consider the problem presented in the 
previous section in which only one measurement g is done, since this does 
not change the main conclusions. We therefore assume that 

9 = f + e (4) 

where e is a random variable whose distribution is Gaussian with zero mean 
and variance a. The likelihood function is therefore 





(5) 


where [3 = 

When we compute the standard maximum likelihood estimate we are as¬ 
suming that the variance of the noise has a fixed value, but this assumption 
is not always realistic. In fact, in many cases the accuracy of the measure¬ 
ment apparatus can fluctuate, due to some external causes, and in these 
cases our data can contain outliers. A more realistic assumption consists in 
considering the variance of the Gaussian noise, and therefore /3, as a random 
variable, with given distribution P(f3). We are therefore led to introduce the 
probability P(g\f } /3) of having observed the data g if the true value was / 
and the variance of the noise was a = =: 


P(g\f,P) 



( 6 ) 


Notice that the right hand side of eq. (6) is the same of eq. (5), but their 
meaning is different. We can now compute the joint probability P(g } f3\f) 
of having observed the data g in presence of gaussian noise with variance 
a = -4?=, if the true value was /: 


P(gJ\f) = P(g\fJ) P(ff) . (7) 

Since we are not interested in estimating /3, but we are interested only 
in the probability of g given /, that is our likelihood function, we integrate 
equation (7) over f3 to obtain the effective noise distribution 
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P’W) = r p (»l/. /3) WW = H=r e- n ’-‘ r \[f> W W (8) 

Jo 7T Jo v 

The MAP estimate is now obtained by maximizing the probability of eq. 
(8), or, taking the negative of its logarithm, solving the following minimiza¬ 
tion problem: 


mm V(g-f) (9) 

where we have defined the so called effective potential 1 

V{x) = -\n j™ e-^y/p P(P)dp . (10) 

In the case in which n observations gi,. . . , g n have been taken the same 

considerations apply, and assuming that the variances cq of the measurements 
gi have all the same probability distribution, we obtain, instead of eq. (9): 

N 

mm ^2 v (9i~f)- (H) 

J i = l 

This equation coincides with eq. (3), that has been proposed has a tech¬ 
nique to “robustize” the least-square estimate (2). In our case, however, 
the effective potential V derives from specific assumptions on how data are 
corrupted by noise. If the distribution of the random variable f3 is a delta 
function centered on some value ff that is if P(f3) = S(f3 — f3) } the noise model 
is Gaussian with fixed variance, and the effective potential is a quadratic func¬ 
tion, yielding the same result of eq. (2). For other probability distributions 
P(/3), formula (10) allows to compute the corresponding effective potential 
by simply performing a one dimensional integration. Conversely, in some 
cases, given an effective potential V(x), it is also possible to understand if 
there is any probability distribution P(f3) that corresponds to it. In the next 
section we introduce a class of effective noise distributions for which such a 
characterization can be given. 

3 A class of effective noise distributions 

In this section we study and characterize a class of effective noise distri¬ 
butions. Since we want to maximize the effective noise distribution (8) we 

Mhis name was previously introduced by Geiger and Girosi (1991), that used a similar 
technique applied at the problem of surface reconstruction with discontinuities. 
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are not interested in effective distributions that are unbounded. It will turn 
out that if an effective noise distribution is bounded at the origin it is also 
bounded on all the real axis. Therefore, according to eq. (8) we define the 
bounded effective noise distributions as the probability distributions of the 
form: 


/(. t ) = dp (12) 

where P(ff) is a probability distribution, and such that the following condi¬ 
tion is satisfied: 


/(0) = yfpP(f3)df3 < +oo . 

We can now prove the following proposition: 

Proposition 3.1 A probability distribution f(x) is a bounded effective noise 
distributions if and only if f(ffx) is completely monotone. 

Proof: (only if) Suppose f(x) is a bounded effective noise distribution. 
Then f(ffx) it can be represented as 

poo 

f(y^) = / e-^dy(ff) 

Jo 

where 

dil 3 ) = [ VrP(T)dT . 

Jo 

Since y(/3) is clearly non decreasing and bounded, then by the Bern¬ 
stein’s theorem on the representation of completely monotone functions (see 
Appendix A), f(ffx) is completely monotone. 

(if) Suppose that the probability distribution f(x) is such that f(ffx) is 
completely monotone. Then it can be represented as 

f(x)= e-' 3 'cHS). (13) 

Jo 

with y(ff) non decreasing and bounded. Since / is a probability distribution 
its integral over the real axis has unit value, and therefore 

/ + CO POO / POO \ 

f(x) = 2/ / e~ fjx dfi(P) )dx 

-oo Jo \J 0 / 
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Exchanging the order of integration and evaluating the gaussian integral, 
we obtain that 


Mft) _ 

Jo y/]3 


0 < c < +oo . 


Therefore it is always possible to write 


dfi(/3) = P(f3) d/3 , 

where P(/3) is a probability distribution, being positive and having finite 
integral. Substituting this expression in formula (13) we obtain the rep¬ 
resentation of eq. (12). Noticing that completely monotone functions are 
bounded at the origin, since 


roo 

/( 0 ) = / dfi(P) < Too , 

Jo 

we conclude that f(x) is a bounded effective noise distribution.□ 

We can now answer to the question if effective potentials of the type V(x) = 
\x\ p can be derived in this framework. In fact, using the previous proposition 
it is sufficient to check if the probability distribution P(x) = e - ^ P is such 
that P(yPc) is completely monotone. Using the fact that the function e~^ P 
is completely monotone if and only if 0 < p < 1 (Schoenberg, 1937)(see 
appendix A), we can immediately derive the following proposition: 

Proposition 3.2 The function V(x) = \x\ p is the effective potential asso¬ 
ciated to a bounded effective noise distribution if and only if 0 < p < 2. 


We notice that if we set p = 1 in the proposition above we obtain as 
effective potential the usual L\ error measure, that is V(x) = |x|, is obtained. 
However, since the function absolute value is not differentiable at the origin 
it has been proposed to use functions that behave quadratically in a neighbor 
of the origin, and linearly for large values of the argument (Eubank, 1988). 
Effective potentials of the form V(x) = \x\ p are interesting, since they are 
convex and the problem of maximizing the likelihood function has therefore 
only one solution. However, before showing what are the effective noise 
distributions that are associated to this effective potentials, we present a 
more simple example, that gives a non convex effective potential that has 
also been used in practice. 
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4 A class of non convex effective potentials 

We have already seen that if the distribution P(P) is a delta function the 
standard quadratic potential is obtained. The simplest non trivial case con¬ 
sists in assuming that P(P) is a sum of two delta functions, that is 

P(P) = (1 - e)8(P - ft) + e8(P - ft) (14) 

where e is a parameter between 0 and 1 and ft, ft are fixed positive numbers, 
ft > ft- If ft is a very small number such a distribution can represent the a 
priori knowledge that a fraction e of the data is very unreliable. In the limit 
of ft going to zero this fraction of data is constituted by genuine outliers, 
and we therefore analyze the model keeping in mind that we are interested 
in this limit. 

With the noise distribution given by eq. (14) the effective potential be¬ 
comes 


V(x) = - in j™ [(1 - e)8(P - ft) + e8(P - ft)]d/3 (15) 

and, after some algebra: 

V(x) = ftft -In (l + —,/5 e - 2 (A-^A ^ ( 16 ) 

V 1 - e V ft / 

where we have neglected unimportant constant terms. 

We start studying the behavior of the potential in a neighbor of the origin. 
Taking a Taylor’s expansion up to the second order, after some algebra we 
find that 


V(x) = V(0 ) + ftft + oft 3 ) 

so that the potential is initially quadratic, and very flat if ft is small, that 
is if we assume that outliers are present in the data. 

When x goes to infinity the exponential term in the logarithm of eq. (16) 
grows very fast, (remember that ft > ft), and the unit term can be omitted, 
leading to major simplifications. This is true only if ft is “not to small”, in 
the sense that the following inequality has to verified: 

ft >> fte, ft) - i In ft (17) 

where &(e, ft) is a constant that depends only e and ft, whose exact form is 
irrelevant to us. In the region where this condition is satisfied we therefore 
obtain: 
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beta2=0.01 - beta2=0.05 - beta2=0.10 


Figure 2: The non convex effective potential of the model above for different 
values of /i 2 . 


V(x) ss fiix 2 — In A’(e,/ii) — -ln/i 2 — .r 2 (/ii — 2 ) 

and therefore: 


V(x ) sa /3 2 x 2 - In k( e, In /i 2 . 

For large values of x and small values of /i 2 , where “large” and “small” 
have to be intended in the sense of condition (17), the effective potential is 
again quadratic and very flat. 

In summary: for small values of x the potential is a very flat parabola, for 
large values of x is the same parabola, but translated of a positive amount 
that grows logarithmically with /i 2 , and in between, since its first derivative 
is strictly positive, it smoothly connects these two behaviors. In fig. (2) we 
show the shape of the effective potential for fixed e and /ii, for three differ¬ 
ent values of /i 2 . We set e = 0.1, j3\ = 4, and /i 2 £ {0.1, 0.05, 0.01}. This 
amounts to say that we know a priori that 90% of the data points are affected 
by Gaussian noise of variance equal to 0.5 (that is \J\)- The other 10%) is af¬ 
fected by Gaussian noise with very large variance, that is a = 3.16, 4.47, 10. 
We notice that for a value of /i 2 = 0.01, that corresponds to a variance 
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a = 10, the effective potential is extremely flat, almost constant. A similar 
behavior is expected: in fact it means that when the interpolation error is 
larger than a threshold its influence on the solution is not taken in account 
anymore, and this is exactly the kind of motivation that led statisticians to 
consider robust models. 


5 A class of convex effective potentials 

We now consider an effective potential of the form 

V(x) = \/ a 2 + x 2 . (18) 

where a is some given parameter, possibly zero. Functions of this type are 
well known in approximation theory by the name of “multiquadrics”, or 
“Hardy’s multiquadrics”, and their behavior is shown in fig. (3). Potentials 
of this shape are interesting because they are convex, so that the minimiza¬ 
tion problem associated with them has a unique solution. Moreover, poten¬ 
tials with a shape very similar to this one can be implemented in analog 
VLSI circuits (Harris, 1990), allowing very fast ways to solve the estimation 
problems. 

We are interested in finding the probability distribution that leads to this 
form of effective potential. A solution to this problem certainly exists, since 
it is easy to show that e~ v ^' > is completely monotonic. Therefore we have 
to find a function P(/3) such that 

e ~Va 2 +x 2 = d/3, 

This is in essence the problem of computing an inverse Laplace transform. 
We start from the following identity (Gradshteyn and Ryzhik, 1981): 

POO o 1 

2y^e~^ = [r 2 e--pe~ xl3 d[3 (19) 

Jo 

and perform the substitution x —> x 2 + a 2 , obtaining 

2y^e~ Vx2+a2 = [°° p-h~^e~ px2 - pa2 dp . ( 20 ) 

Jo 

Making the proper identifications in equation above, and paying attention 
to normalization factors, we obtain as a result: 

P(P) = j 2 e-^- p ° 2+a • (21) 
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Figure 4: The distribution of variance P{cr) associated to the multiquadric 
effective potential. 


Wo can now derive the distribution P{cr) of the variance a 
ing 


-fa' impos - 


P(/3)d/3 = P(a)dcr , f3 = — . 

a 2 

After some algebra we obtain the function 


( 22 ) 


P{cr) = 2cre 4 ct 2 +Q . 

whose shape is depicted in fig, (4) for three different values of a. We no¬ 
tice that when a increases the distribution becomes more peaked, and also 
flatter around the origin. Therefore the probability of having low-noise data 
decreases when a increases. Equivalently, we can also say that the proba¬ 
bility of having data with noise larger than a given threshold increases with 
a. 


6 Conclusions 

We have shown that it is possible to give a simple interpretation to estimators 
based on the solution of the minimization problem 

N 

m i n - ’ ( 23 ) 

4 i=l 
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where V is an appropriate function, that we call effective potential. If the 
function e~ v (^ is completely monotone, using these robust estimators cor¬ 
responds to compute Maximum Likelihood estimators under the assumption 
that data are corrupted by Gaussian noise whose variance fluctuates accord¬ 
ing to a given probability distribution, that uniquely determines V. Typical 
“effective potentials” V, that have been used in the past, belongs to the class 
we consider. 

We notice that the result we derived holds also in the more general settings 
context of parametric and non parametric regression. In order to see why, 
let g = {(xyj/i) G R n X R}iL\ be a set of data that has been obtained 
by randomly sampling a multivariate function / in presence of noise. In 
parametric regression we assume that / is a parametric function h(x;p), 
where p £ i? m , and the optimal set of parameters p is usually recovered by 
minimizing the least square error 

n 

™ n m Efe-M x .;p)) 2 - ( 24 ) 

pfc i=\ 

As in the case considered in this paper, this can be thought as a maximum 
likelihood estimator, under the assumption of Gaussian noise with fixed vari¬ 
ance. Therefore the same argument we applied in section (2) applies here, 
and more robust estimates could be obtained if we replace the quadratic 
function in eq. (24) with an effective potential V. 

In non parametric regression no assumption is made on the specific form 
of /, and a common technique consists in solving the following minimization 
problem: 


N 

min XX/( x 0 “ Vif + XS \.f1 ■ 

* i=l 

where S[f] is an appropriate convex functional of / and A a positive number. 
This correspond to compute the Maximum A Posteriori estimator, under 
the assumption of Gaussian noise and a priori probability for / given by 
P(f) oc e -A5 W. If we assume that the variance of the Gaussian noise is a 
random variable, using the same argument we used in section (2) we can prove 
that the Maximum A Posteriori estimator solves the following minimization 
problem: 


N 

m , in 5>(/(x0 

A 8 = 1 


Vi) + \S[f\ 


(25) 
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where V is an effective potential. Estimators of this type are known in the 
statistical literature as M-type smoothing splines (Eubank, 1988), and their 
implementation in analog VLSI circuits has been considered by J. Harris 
(1991) for some choices of the functional S[f]. 

A Completely Monotone Functions 

We need to give the following: 

Definition A.l A function f is said to be completely monotonic on (0,oo) 
provided that it is C°°(0,oo) and (—l) , |^j(x) > 0, Vr G (0, oo) ; V/ £ Af, 
where AT is the set of natural numbers. 

A typical example of completely monotone function is the exponential 
function f(x) = e~ ax , with a > 0. It turns out that all the completely mono¬ 
tone functions are linear superpositions with positive coefficients of scaled 
exponentials, as the following theorem of Bernstein shows: 

Theorem A.l (Bernstein, 1929) The class of completely monotone func¬ 
tions is identical with the class of functions of the form 

roo 

g(x) = / e~ px d/i(f3), 

Jo 

where y(/3) is non-decreasing and bounded for f3 > 0. 
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